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Abstract 

The problem of prediction in functional linear regression is conventionally addressed 
by reducing dimension via the standard principal component basis. In this paper we 
show that an alternative basis chosen through weighted least-squares, or weighted 
least-squares itself, can be more effective when the experimental errors are het- 
eroscedastic. We give a concise theoretical result which demonstrates the effective- 
ness of this approach, even when the model for the variance is inaccurate, and we 
explore the numerical properties of the method. We show too that the advantages 
of the suggested adaptive techniques are not found only in low- dimensional aspects 
of the problem; rather, they accrue almost equally among all dimensions. 

Keywords: Cross-validation; eigenfunction; eigenvector; functional data analysis; 
functional linear regression; mean squared error; orthogonal series; principal com- 
ponent analysis; rate of convergence; weighted linear regression. 
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1 Introduction 



The functional linear model has the appearance of being rather conventional. It 
involves representing a scalar response, Y, as 



where X denotes the function-valued explanatory variable, a is a scalar, (3 is the 
function-valued slope parameter, and X is a known compact interval. However, 
estimation of f3 is generally a nonparametric problem, and the level of complexity 
implicit in that property can carry over to the problem of prediction, in which 



be estimated root-n consistently, where n denotes sample size, but more commonly, 
estimators converge at strictly slower rates. Cai and Hall (2006) discuss these issues, 
and Faraway (1997), Ferraty and Vieu (2000), Cuevas et al. (2002), Ramsay and 
Silverman (2005, Chapter 12), Cardot et al. (2006) and Cardot and Sarda (2006) 
address functional linear regression in more general terms. 

A standard approach to estimating a and /5 is to first estimate the principal 
component basis from a sample of observations of {X,Y), and then construct an 
estimator of /i(x) = a + f^Px in terms of that basis, using least squares. However, 
in practice the distribution of the error in (1) can be a functional of the distribution 
of X, and the optimal choice of basis can depend significantly on x. To address 
these challenges we could construct the basis so that it gave greater emphasis to 
observations of X that were relatively close to x. For example, we could restrict 
attention to X for which ||X — x\\ < 6, where || ■ || was a suitable distance measure 
and 6 played the role of bandwidth, although 6 would not necessarily be chosen to 
converge to zero as n increased. More subtly, the basis could be constructed by 
applying kernel weights to each observation. See Mas (2008) for theoretical results 
addressing problems of this type. 

Although this approach is attractive, practical difficulties can arise from the im- 
plicit reduction in sample size that is involved. An alternative method is to estimate 
the variance, cr(X)^ say, of the distribution of the error in (1) conditional on X, and 
adapt prediction to the level of variability there. We suggest solving this problem 
by modelling a{xY as a function of a + JjP x, and using its inverse, with x replaced 




(1) 



we wish to estimate a + Jxl3 x for a given function x. Sometimes a + jjl3 x can 
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by a data value X, as a weight in the basic least-squares problem. We then show 
that calculations can be simplified by computing a new principal component basis, 
adapted to heteroscedasticity. While our approach has some similarities with the 
weighted least squares method used for finite dimensional data, it differs signifi- 
cantly due to the intrinsic nonparametric, and infinite dimensional, characters of 
functional linear regression; we quantify these issues in theoretical terms. 

In summary, this paper makes three main contributions. First, we show in sec- 
tion 2 that adaptive modification of the standard principal component basis, or 
a nearly-equivalent method based on weighted least-squares, can be advantageous 
when undertaking functional linear prediction, i.e. when estimating /u(x). Secondly, 
we suggest approximations to the value of cr(x)^, and we employ them to construct 
a second basis, this time adapted to heteroscedasticity. Then, in sections 3 and 4 we 
show that this approach can give real and effective reductions in mean squared er- 
ror, even when the model we use to estimate variance is not completely correct. An 
alternative approach would be to use a nonparametric method to estimate variance. 
However, unsurprisingly given the high degree of noise associated with estimating 
nonparametric functions, numerical work shows that parametric methods are prefer- 
able. These results all have analogues in cases where F is a multivariate response, 
although for simplicity and transparency we focus only on the univariate case. 

The main theoretical result in section 3 gives a concise account of the way adap- 
tive methods can improve the performance of estimators in functional linear regres- 
sion. In particular, we show that the advantages accrue almost equally among all 
dimensions; they are not principally to be found in low- dimensional aspects of the 
problem. 

Previous developments of principal components analysis for functional data play 
a central role in our work. Early contributions include those of Besse and Ramsay 
(1986), Ramsay and Dalzell (1991) and Rice and Silverman (1991). From that point 
a very substantial literature has developed, including but by no means limited to 
the work of Silverman (1995, 1996), Brumback and Rice (1998), Cardot et al. (1999, 
2000, 2003), Cardot (2000), Girard (2000), James et al. (2000), Boente and Fraiman 
(2002), He, Miiller and Wang (2003), Ramsay and Silverman (2005, Chapters 8-10), 
Yao et al. (2005), Hall and Hosseini-Nasab (2006), Jank and Shmueli (2006), Ocafia 
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et al. (2007), Reiss and Ogden (2007) and Huang et al. (2008). 



2 Methodology 

2.1 Orthogonal series approach to inference in the hnear 
model 

The functional linear model argues that independent data pairs (-^[i], ^i), • • • , {^[n], 
Yn), distributed as {X,Y), are generated as 

Y = a + j pX + e, (2) 

where a is a scalar, j3 and X are functions defined on the compact interval X, and 
E{e I X) = 0. Square-bracketed subscripts here distinguish the ith observation of 
X, from the zth principal component score, which is conventionally represented 
by Xi. The prediction problem is that of estimating /i(x) = E{Y \X = x) = 
a + JjPx with (a,/5) at (2), where x denotes a particular value of X and is a 
scalar functional. 

A standard approach to estimating /i(x) is to introduce an orthonormal basis, 
say il)i,il)2i ■ ■ -i and argue that (3 and x admit convergent expansions with respect to 
this sequence, i.e. 



oo 



p = ^bjijj, x = ^ Xj ijjj , ^i{x) =a + ^hj Xj , (3) 
j=i j=i j=i 

where bj = JjP ipj and xj = JjX ipj . Estimators d of a and bj of bj , for j > 1 , are 
then constructed from the data by minimising 



Sr{a, bi, . . . ,br) = ^ ( - a - ^ bj Xij j 

i=l ^ j=l ^ 



(4) 



where X^ = j ipj and r denotes the frequency cut-off, a smoothing parameter. 
These definitions of a and bi, . . . ,br reflect the definitions of a and f3 at (2) and, for 
appropriate choice of r, ensure consistency. The resulting estimator of fi is 

r 

/i(x) = d + bj Xj . (5) 

A thresholding method could also be used instead of "cut-off smoothing," but the 
difficulty of estimating the variance of bj makes that approach unattractive. 



2.2 Principal component basis 

It is common to take ipi,ilj2, ■ ■ ■ to be the principal component basis, ordered so 
that the corresponding eigenvalues form a decreasing sequence. Specifically, define 
K{s,t) = cov {X (s) , X (t)} to be the covariance function of X, and construct the 
spectral decomposition of K, 

oo 

K{s,t) = J20,i^MUt), (6) 

where Oi > 62 > ■ ■ ■ > and {Oj,'ipj) are the (eigenvalue, eigenfunction) pairs of 
the transformation that takes ip to Kip, defined by {Kil)){t) = jj K{s,t) tpls) ds. 
Then the orthonormal functions ipj make up the principal component basis. The 
j'th uncentred principal component score of X is Xj = j^X ipj. 

In practice the principal component basis is unknown, and needs to be estimated 
from data. To this end we define 

n 00 

k{s, t) = n-' J2 i^d') - ^(^)} {^W W -Xm = Y. ^:'(^) W ' 
i=i j=i 

where X = Yli -^[i]^ K(s,t) is an estimator of K{s,t), (6j,ipj) are (eigenvalue, 

eigenfunction) pairs for the transformation represented by K, and the order of the 

indices j is chosen to ensure that Oi > 62 > . . . Then 9j and ipj are our estimators 

of 6j and ipj, respectively, and we would replace (4) by 

n / r V 2 

^^.(a, 61, . . . , 6^) = ^ ( - a - ^ hj Xij j , (7) 

i=l ^ j=l ' 

where Xij = JjX^i^tjjj, giving the obvious estimator /i(x) of /i(x). Equivalently, 
since a = Y — X then, writing Xj = Y2i -^ij^ "we can minimise 

i=l ^ j=l 

over bi, . . . ,br, obtaining the same numerical values 61, . . . , 6^- as we do when min- 
imising (7). Then, defining xj = JjXipj, we take 

r 

l2{x) = F + 5^ 6, (x, - X,) (9) 

i=i 

to be our estimator of fi{x). In a slight abuse of notation, when discussing practical 
implementation we shall write a and bj for the quantities that minimise (7) rather 
than (4). 
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2.3 Adapting to the variance of Y 

The estimator /I at (9) is conventional, but does not take into account the fact that 
the errors at (2) are often heteroscedastic. When a significant amount of variabihty 
is explained by that aspect of the problem, we should replace Sr{ot^ bi, . . . ,br) at (7) 
by its form where a weight, equal to an approximation to the inverse of the variance 
of Yi — a — jj(3 conditional on Xjj], is incorporated into the series at (7). 

In conventional parametric regression, the conditional variance of the regression 
errors is often modelled as a power of the assumed parametric form of E(Y\X). See 
for example Carroll and Ruppert (1988). In the functional data context we propose 
modeling var(e | X) by 

aiXy = g(a+y]b,X,], (10) 



3=1 ^ 



with g a univariate function and a, bi, . . . ,br and r as in (7) and where Xj is the 
jth principal component score. An adaptive form of g that is often suitable is the 
"power of the mean" model 

g{u) = Icu^, (11) 

where Ci and C2 are constants, or the version of it which includes an intercept term. 
See Carroll and Ruppert (1988, pp. 5, 65). 

In principle, a nonparametric estimator of g could be used. However, in appli- 
cations to real functional data, where sample sizes are generally only moderate in 
size, we found that the increase in variance resulting from the nonparametric fit 
significantly outweighed any improvements in performance that might be expected 
using that approach. Bear in mind that it is not necessary to have consistent esti- 
mators of the variances in order to enjoy improved statistical performance, even in 
the asymptotic limit. We shall take this point up again in section 3; see point (ii) 
below the Theorem there. 

To estimate cr{XY we interpret the unweighted estimators a and (3 = X]j<r "^i 
as pilot estimators of a and j3 = J2j i^jy respectively, and use them to calculate 
residuals q = Fj — d — bjXij. Since these quantities are already centred, we 
define ^ 

T(ci, c) = J2 ft - cJa + Y, h I (12) 

i=l ^ 7 = 1 ^ ^ 
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and choose Ci and C2 to minimise T(ci, C2). In this notation our estimator of var(y - 
a — fj/3X\X = x) is, when x = 



ci I a 



j=i / 

Next we incorporate these weights into the objective function at (8), obtaining: 

n c t ^. 2 

where = {E. wiX^^)}-' ^(^w)l^. and X,-^ = {E. E. ^^(^h) 

Xij] and we choose 6^1, . . . , to minimise Utipi, . . . ,bt). A new estimator of 
is given by the following analogue of (5), based on the new coefficient estimators: 

t 

Jlnj{x) = + E (% - Xj^^) . (14) 
i=i 

A computational advantage of defining estimators by minimising Sr{a,bi, . . . ,br) 
at (4), rather than Ut{bi, . . . , bt) at (13), is that the "ex transpose ex" matrix in the 
former case is simple to invert. Indeed, by definition of Xij in terms of the orthogonal 
functions V^j, the matrix with (j, A;)th term Ei {Xij~-Xj) i^ik — Xk) is diagonal. 
The fact that this does not hold in the case of the objective function Ut{bi, . . . , bt) 
refiects the fact that the orthonormal basis functions tpj are not necessarily, in this 
case, the natural ones. Instead we could replace Ut{bi, . . . , bt) by 

n s s 2 

i=l ^ j = l ^ 

where we define Xij = and Xj^^ = «^(A^[i])}~"^ Ei w{Xiq)Xij, and 

where the orthonormal functions 0i,02,..., with corresponding eigenvalues Qi > 
0)2 > . . ., are defined by the following spectral decomposition 

n s —1 n 00 

\ 5^ W](.)-x(.)}{X[,](t)-x(t)}^?(X[,]) = %0,(.)0,(t). 

t=l ^ i=l j=l 

Taking bwi, . . . , b^s to minimise Vs{bi, ...,6s), a competitor with Jlw{x) at (14) is 
given by 

fi^{x) = Yw + Y (y ^j,^ ■ (16) 

The numerical differences between /i^ and generally very small. 



2.4 Practical choice of smoothing parameters 

The methodology outhned in sections 2.2 and 2.3 involves two smoothing param- 
eters: r, in the equivalent objective functions 5*^ and 5^^^"^ at (7) and (8), and t, 
in Ut at (13), or s, in Vg at (15). We propose selecting these parameters by cross- 
validation, as follows. Omit the data pair {X^^,Yi) from the sample, and, using the 
remaining n — 1 pairs, construct the predictor jlw{x) at (16) for a general r and s; 
denote it by jlw_i{x | r, s). Put W{r, s) = J2i V^i ~ i^w-ii^m I s)}^, and choose 
(r, s) to minimise W{r,s). The same approach is used to select r and t for the 
predictor Jiw^x) at (14). 

3 Theoretical properties 

We shall suppose that we model the variance var(e | X) = (y{XY as r(X)^, where the 
function r is known but may not equal a. That is, our model may not actually be 
correct. We shall make three simplifying assumptions: (a) The principal components 
JjXipj of X are independent, rather than merely uncorrelated; (b) the principal 

component basis ipi,ilJ2, ■ ■ ■ is known; and (c): 

e = <7{X) 6, where 6 is stochastically independent of X, E{S) = 0, 
E{S^) = 1, the functional a is bounded, r is bounded above zero, and 
(t{X) and t{X) depend on only a finite number of the principal compo- 
nent scores Xj = J^X ipj; that is, for some t > 1 and positive integers ^^""^ 
ji, . . . ,jt we can write cr(X)^ = var(e | X) = h{Xj^, . . . , X^J, where h 
is a positive, t-variate function which is bounded away from zero and 
infinity, and t(X)^ can be represented in the same way. 

Assumption (a) can be relaxed to a mixing condition, and (b) can be removed by 
using the estimated ipjS rather than their true forms. The latter approach requires 
a regularity condition on the spacings of the eigenvalues and in that setting 
a complex technical argument, similar to those given by Hall and Hosseini-Nasab 
(2008), is needed. Condition (c) can be relaxed by noting that if cr(-) is sufficiently 
regular, and if the scores Xj are independent, then (t{X) can be approximated by 
a sequence of functions at{Xi, . . . , Xt), for t > 1, where cr^X) — at{Xi, . . . , Xt) 
converges to zero as t ^ cxd, with a similar constraint imposed on r(X). 

Recall that the pair (X, Y) is generated by the model at (2), where the error, e, 
has zero mean, and we wish to estimate /i(x) = a + jj(3 x for a particular function x. 
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Our estimator, which is equivalent to that given at (14) with w = r(X[j]) ^, is defined 
by fiM = Y^ + J2j<r h {xj-Xj^J, where = r(X[i])-2}-i J2^ ^{^[i])'^ 
and Xj^^ = TiX^i])""^}"^ J2i '^(-^[i])"^ Xij and 61, 62, • • • are chosen to minimise 

n y- r 

i=l j=l 

Since we centre the principal component scores Xij at their respective means 
then we may, and do, assume without loss of generality that E{X t{X)~^} = 0. 

The eigenfunctions ipj and eigenvalues 9j are defined by (6), and, in addition to 
(17), we assume that: 

the principal components fjX ipj are independent , (19) 

Er=i ^1 = 00 , J^P'<oo, (20) 

r = r{n) ^ 00 as n ^ 00 , and r = 0(n~''"^^^/^)) for some r] G (0, |) , (21) 

< 00 , sup^.>i ej'' E{J^X ijjf'" < 00 for each integer k>l. (22) 

Write AMSE{/I^(x) — /i(x)} for the asymptotic mean squared error of the esti- 
mator. The following theorem, which is derived in section 5, describes asymptotic 
properties of this quantity. 

Theorem 3.1. // (17) and (19)-(22) hold then as n and r diverge together, 
AMSE{/.Ua:) - Mx)} = ^jg^^^^^i^^^"^ ± Oj^ x] + ( f^^ 6, . (23) 

Among the implications that can be drawn from (23) are the following: 

(i) If the model, r^, for the variance, o"^, is essentially correct, i.e. if r equals a con- 
stant multiple of cr, then the factor, = E{a{Zf t{Z)-^} [E{r(Z)-2}]-2, outside 
the first term in (23), which represents the variance contribution to asymptotic mean 
squared error, reduces to simply [£'{cr(X)~^}]~^; whereas that factor would be sim- 
ply E{a{Xy'} if we were to use unweighted least-squares, i.e. if we were to take t{X) 
to be constant. The fact that, by Jensen's inequality, [E{a{X)~'^}Y^ < E{a{XY}, 
demonstrates the effectiveness of the adaptive approach. 

(ii) If the model is essentially incorrect, i.e. if r does not equal a constant multiple 
of a, then the estimator remains consistent and enjoys the same convergence rate 
as before, but with an inflated constant multiplier. More generally, if the variance 
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functional is not constant, and if the model is wrong but approximately correct 
(in particular, if r(X) is sufficiently close to (j{X) for sufficiently many values of X), 
then is reduced relative to the value it would have if we were to simply take r = 1. 
These properties point to the fact that it is not essential to use a nonparametric 
estimator of variance in order to obtain an improvement in performance over stan- 
dard least-squares. As noted in section 2.3, for the small to moderate sample sizes 
commonly encountered with functional data, weighted least-squares based on non- 
parametric estimators of variance generally perform poorly because the estimated 
weights introduce too much extra variability. 

(iii) The factor p^, defined in (i) above, is applied to each and every term in the 
series Ylij<r ^'j (2^)' ^'^^ reduce in size as j increases. Therefore 
the advantages of correcting for heteroscedasticity are valid with equal force for 
arbitrarily large dimension; they do not relate just to low-dimensional aspects of the 
problem. 

(iv) As is to be expected, the effect of weighting has an impact only on the variance 
contribution to asymptotic mean squared error, not on the bias component. How- 
ever, even if the problem is finite-dimensional the impact of the variance component 
persists even in the asymptotic limit, and so there is always something to be gained, 
in asymptotic terms, by adapting the estimator appropriately to heteroscedasticity. 

(v) The first part of (20) determines that the estimator yU^(x) has nonparametric, 
rather than parametric, convergence rates. It holds if we treat x as a realisation of 
X, and average over all such realisations. In particular, if x is distributed like X 
then E{x]) = 9j, and so E(J2j>i ^J^ x]) = Ej 

1 — oo, implying that the first 

part of (20) holds "on average." 

4 Numerical illustrations 
4.1 Real data example 

We applied our method to Australian rainfall data available at http: //www. world 
weather . org. The data consist of rainfall measurements at each of 206 Australian 
weather stations, averaged over the 40 to 100 year periods for which the weather 
stations had been in use. For any given station we took Y to equal the average 
yearly rainfall divided by the average number of days on which it had rained, which 
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we refer to as intensity. Our predictor X{t) equalled the rainfall at time t, where 
t represented the fraction of the year that had passed at the time of measurement, 
and rainfall at that time was averaged over the years for which the station had been 
operating and was obtained by passing a local polynomial smoother through discrete 
observations. 

The majority of weather stations fall into one of two classes, which respectively 
comprise most stations in southern parts of the continent, which tend to follow 
a "European" rainfall pattern where the majority of rain comes in cooler months 
and summer is relatively dry; and most stations in northern regions, which exhibit 
a "tropical" pattern where most rain falls in mid to late summer and the cooler 
months are generally dry. Only a small number of weather stations have more 
complex rainfall patterns that are not of one of these two types, although some 
northern stations reflect rainfall southern patterns, and vice versa. These features 
suggest that most of the data might reasonably be assumed to come from a mixture 
of two populations. Those populations might produce different error variances in 
the linear model, leading to heteroscedasticity. 

We removed two weather stations, Ipswich and Katherine, which appeared to 
be outliers, in the sense that the functional linear model explained their rainfall 
relatively poorly. Then we applied our method to the n = 204 remaining stations. 
To test the method we generated B = 500 samples, each of size n = 102, by randomly 
removing half of the 204 stations. For each of the B samples we then applied our 
method to predict the value of Y for each of the 102 removed stations. Note that, 
since these were real data, we did not know the true value of the target /i, and so we 
compared the predicted value with the true value of Y. For each of the B samples 
we calculated the mean squared errors for the 102 predicted stations, that is, for 
b = 1, . . . , B, we calculated 

^ 102 102 

i=l i=l 

102 

and MSE, = — 

i=l 

For each of the B samples, we also calculated the proportion of the 102 pre- 
dicted stations that were better predicted by the weighted methods, that is, for 
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Figure 1: Plot of the 500 ordered values of log(MSE^,6/MSEb), top left, log(MSE^,b 
/MSEfo), top right, pw,b, bottom left, and of pw,b, bottom right. Horizontal lines are 
for reference only. 



6 = 1, ... ,5, we calculated p^^b = - Y^'^ < - Y^'^}/102 and 

In Figure 1 we present graphs of the resulting B = 500 ordered values of 
log(MSE,^,6/MSE5), of log(MSE^,6/MSEb), of p^^b and of p.^^^- We see that both 
weighted methods gave very similar results, and that both strongly bettered the un- 
weighted predictor /t: for most of the 500 samples, the MSE of the weighted methods 
was inferior to that of the unweighted method. Moreover, in more than 90% of the 
cases, the proportions pw^b and p.^j^b were higher than 0.5, meaning that for most of 
the B = 500 samples, more than half of the 102 predicted values were closer to the 
true Yi when using the weighted method rather than the unweighted method. 

4.2 Simulations 

Using the same 204 functions X{t) as in section 3.1, we also tested the weighted 
methods on some generated Y data. The advantage of simulated data is that we 
know the target fi{x) = E(Y\X = x), and thus it is easier to assess the quality of the 
predictors. We generated 204 F- values according to the model at (2), where, we took 
a = 0, (3{t) = 0.02 ■ sin{8 - (7r/20t)}(l{t<i9o} + 0.5 ■ l{t>i9o}) and e = f{X)U where 
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Index Index 

Figure 2: Plot of the 500 ordered values of log(MSE^^fe/MSEb), left, and of pw,b, 
right, for the generated data with /(X)^ as in (z). Horizontal lines are for reference 
only. 




100 200 300 400 500 100 200 300 400 
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Figure 3: Plot of the 500 ordered values of log(MSE^ h/MSEb), left, and of Pw,b, 
right, for the generated data with f{X)^ as in [ii). Horizontal lines are for reference 
only. 



U ~ f/[-3/4, 3/4]. We tried two models for f{Xf: (z) f{Xf = 0.1-{/ /3{t)X{t) dt}^ 
and (ii) f{Xf = 0.1 ■ {/ f3{t)X{t) dt}^ + 0.2 ■ | / (3{t)X{t) dt\^/^. Note that the 
function f] as defined above is a smoothed and simplified version of the estimated /3 
of the real data of section 3.1. 

We proceeded as in section 3.1 and, by randomly splitting the data (-^[i], Yi) into 
two parts, constructing B = 500 samples of size n = 102, and each time applying 
the method to the 102 remaining data points. In both cases (i) and (ii) we took 
the function g at (10) equal to g{u) = Iciul'^^^ Even though this form is only an 
approximation to the real g for case {ii), we shall see below that the weighted 
methods worked well there too. 

In this case since we knew the target /i, we replaced Yi by in the definitions 

of MSE^^fe, MSEft, MSE^„ p^^b and Pw,b- Figures 2 and 3 show, respectively, the 
results for cases (i) and (ii). We show only the predictor /i^; the results for fi^ 
were similar. The figures illustrate the improvement that can be gained by using a 
weighted version of the predictor. 
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5 Proof of Theorem 



We give the result first in the homoscedastic case, where we take both a and r to 
equal constants, and then we generalise it to the heteroscedastic setting. The model 
(2) can be written equivalently a.sY = a + Jj(3 {X — EX) + e, for the same function 
f3 but for a different scalar a, which now equals E{Y). We shall work with this 
model below. The least-squares estimator of is the same as before, but the 
corresponding estimator of a is now simply a = Y. In particular, using the new 
model and making assumptions (20) and (22), a is root-n consistent for a: 

a - a = Op{n-^^^) . (24) 

The least-squares estimators bi, . . . ,br are the solutions of 

S{k,...,brf = S, (25) 

where S = (sjua) is an r x r matrix, s = (si, . . . , Sr)"^ is an r-vector, 

^ n 1 " 

i=l 1=1 

Xij = J ipj, Xj = Xij and Y = J2i Without loss of generality, 

each E{Xij) = 0. Put Zij = XijOJ^^'^. Then the variables Zij have zero mean and 
unit variance, and Zi^j-^ and Zi^j^ are independent for arbitrary ii, 12 and for ji 7^ 72 
(see (19)). In this notation, Sj^j^ = {dj^ (^j2Y^'^ijd2 Sj = Oy^ij, where 



1=1 

- J2 - Zj) {Y,-Y) = -J2 - Zj) / P (X[,] - X) + 6, - 6 

i=i i=i ^-^i J 

00 1 " 

^ bk 6l^^ ijk + Uj , % = - X] {Zij - Zj) {ei - e) , 

k=l i=l 



n 



Zj = ^Zli Zij and e = n ^ Xij ^i- Define too vj = X]fc>r+i h^l^'^ijk, and 

3I/2 /Jl/2^ 



put T = (tjija) ^iid -D = diag(6']^' ,...,6r ), denoting r x r matrices, and -u = 
(ui, . . . , Ur)"^, V = {vi, . . . , Vr)"^, b = (61, . . . , br)^ and b = (61, . . . , fcr)"^, representing 
r X 1 vectors. In this notation, (25) is equivalent to T D ib — b) = u + v. De- 
fine II A IP = XI ji '^iii2' shall show shortly that ||y4|| in probability as 
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n oo; see the paragraph containing (29). Therefore the probabihty that T = I + A 
is invertible converges to 1. When T is invertible, 



fJ'w{x) — fi{x) — {a — a) + bj Xj = (bj — b 

j=r+l j=l 



'3 1 ^3 



(27) 



Write t 



5j,j, where 5j^j^ denotes the Kronecker delta and A= [a 



J\32 "JlJ2 I ""^1^2 5 



is an r X r random matrix. Assuming that v 



in probabihty, 



f~^ = {I + AY^ = I - A + ... + {-lf~^ A^^^ + {-lf [I - A + - ...) 
= I - A + . . . + (-1)''-^ A''-^ + A'' Ak , 

where Ak = (o/cjija) denotes a random matrix and \\Ak\\ < (1 — i^)^^- Therefore, 
writing A'^ = (ctjj'jj), and defining = ^j<r 6j^, we have. 



(r 
J2[ D-^ {f-^ -I + A- + A^"^]{ 

.7 = 1 

• r 

{D-'A'A,{u + v)]^x 



U + V] 



Xj 



r 



r r f r r 

< CrY^ [{A'' Ak {U + V)}j\'^ = CrY \ 4i! ^k,nj2 {u + V)j2 

j = l j = l ^ ji = l j2 = l 

j = l ji = l J ji = l j2 = l J 

r 

< C,||A'=||'Pfcf ^{(^ + t)),}2. 



(28) 



Assumptions (19) and (22) imply that, for each integer i > 1, 



i<Ji,j2<r 



(29) 



Therefore, £'(||A'''|p) = 0{(r^/?i)'^}, and so, since (21) implies that r/n^^'^ 0, 
we have z/ — > in probability. The property rjrt^l'^ — > also entails ||Afc|| = 



Op(l). Furthermore, £'( 



.1/2 



0(n ^) uniformly in j > 1, and if 1 < J < r 



then \vj\ = I X]fc>r+i ^fc ^fe '^jfel- '^^^ latter result, (22), (29) and the fact that 
(E, < iZj b]) (Ej Gj) < imply that E(^)|) = O(n^i), uniformly in 

1 < J < (Note that, in view of the second part of (20), b'j < oo, and by (22), 
E\\Xr = Z,Oj<oo.) 
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Combining these results we deduce that the right hand side of (28) equals 
Op{cr (r7n)'rn"i} = Op(c, r^'^+i ^"(^+1)) . 
Hence, by (27) and (28), 

r 

fi^ix) - fiix) - (a - a) = [{/-A + ... + (-l)^-M'=-i} (« + {)) 

oo 
j=r+l 

Using the fact that r^/n — > it can be shown by direct calculation that, for each 
integer k> 1, 



E 



(31) 



Taking k arbitrarily large, and using (30), (31) and the fact that r = 0{n-'i+^'^/'^'^) 
for some r/ > (see (21)), we deduce that. 



yU^(x) — /i(x) — {a — a) = V — hj Xj + o[c"J'^ n 



j=r+l 



where V = ^j<r ^^'^ + 

Note too that, since we are addressing the homoscedastic case, 

r r 

ii=i i2=i 

Now, Eitj^j^) = (1 — n^^) Sjj^ and, recalling that each E{Zij) = 0, 



-^{(^lii - Zji) (^ifci - ^fci) (^1^2 - ^12) (^ifc2 - 
= (1 — n~^) E{Zij^ Ziki Zij^ Zik2) 

= (1 — n~^) 5j^j2 Sk^k2 ) 



(32) 



(33) 



using the properties ji,j2 < ''^ and ki,k2 > t + 1, and the fact that the Zijs are 
independent. Hence, 



00 00 



'i2fe2y 



fci=r+l fc2=r'+l 

n""*^ (1 — n~^)^ 6j-^j2 dr 



(34) 
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where dr = J2k>r+i ^1 ^k- Using (33), (34) and the fact that (i^ — ^ as r ^ oo, we 
deduce that, as r and n diverge together, 

E{V^) = - ^ ej^ x] { (1 - n-^) + (1 - n^f rf,} ~ c, . (35) 

In view of the first part of (20), — >■ oo as r — >■ oo. Formula (23), but with 
[-E'{(t(X)~^}]~^ replaced by a^, follows from this property, (24), (32) and (35). 

Next we outline the argument that extends this result to the heteroscedastic 
setting. First we discuss a version of the theorem in an artificial problem where 
the error variance is a function of Z, say, which is independent of (X, F) but is 
observed along with that pair. That is, the model (2) now has the form Y = 
a + jj(3X + a{Z)6, where the perturbation 6 is independent of X and Z and 
has zero mean and unit variance. The appropriately weighted criterion function 
is that at (18) but with r(X[j]) replaced by r(Zj). In this case the proof above is 
easily re- worked, in particular with the factor r(Zj)~^ included in both series at (26) 
and in subsequent series, to show that the asymptotic mean squared error of /i«,(x) 
continues to be given by (23) but with E{a{XfT{Xy^} [E{r(X)-2}]-2 replaced 
by E{(T{Zy r{Z)"'^} [E{r{Z)~'^}]~^ . To appreciate the origins of this result, note 
that in the simpler model where /3 vanishes and Y = a + <y{Z) 5, the variance of 
the weighted least-squares estimator of a is exactly p{nY = E[{'^- aiZi)^ T{Zi)~^} 

under the assumption in (17) that (t{z) is bounded and t{z) 
is bounded away from zero, p(n)^ ~ E{a{Zy t{Z)^^} [£'{r(Z)^^}]~^. 

This result continues to hold when a and r are functions of X rather than 
Z, and depend on only a finite number of principal component scores. The proof 
proceeds by noting first that if var(e | X) = h{Xj-^, . . . ,Xj^), as in (17), and if we 
assume that the components with indices ji, ■ ■ ■ ,jt are known and therefore do not 
need to be estimated, then we are in exactly the position addressed in the previous 
paragraph; we can take Z to be [Xj-^, . . . , Xj^) and replace X by the expansion 
Y^'j Xj ipj where the summation Yl'j is over only those indices j not included among 
ji,...,jt. Moreover, the asymptotic mean squared error formula is unaffected if 
we eliminate the components corresponding to j = ji, . . . or if we take those 
components to be known. 
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